Unsupervised Learning Project

Data Description:

The data contains features extracted from the silhouette of vehicles in different angles. Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.

Domain:

Object recognition

Context:

The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.

Attribute Information:

 All the features are geometric features extracted from the silhouette.  All are numeric in nature.

Learning Outcomes:

 Exploratory Data Analysis  Reduce number dimensions in the dataset with minimal information loss  Train a model using Principle Components

Objective:

Apply dimensionality reduction technique – PCA and train a model using principle components instead of training the model using just the raw data.

Steps and tasks:

  1. Data pre-processing – Perform all the necessary preprocessing on the data ready to be fed to an Unsupervised algorithm (10 marks)
  2. Understanding the attributes - Find relationship between different attributes (Independent variables) and choose carefully which all attributes have to be a part of the analysis and why (10 points)
  3. Split the data into train and test (Suggestion: specify “random state” if you are using train_test_split from Sklearn) (5 marks)
  4. Train a Support vector machine using the train set and get the accuracy on the test set (10 marks)
  5. Perform K-fold cross validation and get the cross validation score of the model (10 marks)
  6. Use PCA from Scikit learn, extract Principal Components that capture about 95% of the variance in the data – (10 points)
  7. Repeat steps 3,4 and 5 but this time, use Principal Components instead of the original data. And the accuracy score should be on the same rows of test data that were used earlier. (hint: set the same random state) (10 marks)
  8. Compare the accuracy scores and cross validation scores of Support vector machines – one trained using raw data and the other using Principal Components, and mention your findings (5 points)
In [1]:
#Import all the necessary modules
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from scipy import stats

from sklearn import tree
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from warnings import simplefilter # import warnings filter
simplefilter(action='ignore', category=FutureWarning) # ignore all future warnings
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
In [2]:
# Load dataset
df= pd.read_csv("vehicle-1.csv")
df.head(10)
Out[2]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197 van
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199 van
2 104 50.0 106.0 209.0 66.0 10 207.0 32.0 23.0 158 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196 car
3 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207 van
4 85 44.0 70.0 205.0 103.0 52 149.0 45.0 19.0 144 241.0 325.0 188.0 127.0 9.0 11.0 180.0 183 bus
5 107 NaN 106.0 172.0 50.0 6 255.0 26.0 28.0 169 280.0 957.0 264.0 85.0 5.0 9.0 181.0 183 bus
6 97 43.0 73.0 173.0 65.0 6 153.0 42.0 19.0 143 176.0 361.0 172.0 66.0 13.0 1.0 200.0 204 bus
7 90 43.0 66.0 157.0 65.0 9 137.0 48.0 18.0 146 162.0 281.0 164.0 67.0 3.0 3.0 193.0 202 van
8 86 34.0 62.0 140.0 61.0 7 122.0 54.0 17.0 127 141.0 223.0 112.0 64.0 2.0 14.0 200.0 208 van
9 93 44.0 98.0 NaN 62.0 11 183.0 36.0 22.0 146 202.0 505.0 152.0 64.0 4.0 14.0 195.0 204 car

1.Data pre-processing

In [3]:
df.shape
Out[3]:
(846, 19)
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 846 entries, 0 to 845
Data columns (total 19 columns):
compactness                    846 non-null int64
circularity                    841 non-null float64
distance_circularity           842 non-null float64
radius_ratio                   840 non-null float64
pr.axis_aspect_ratio           844 non-null float64
max.length_aspect_ratio        846 non-null int64
scatter_ratio                  845 non-null float64
elongatedness                  845 non-null float64
pr.axis_rectangularity         843 non-null float64
max.length_rectangularity      846 non-null int64
scaled_variance                843 non-null float64
scaled_variance.1              844 non-null float64
scaled_radius_of_gyration      844 non-null float64
scaled_radius_of_gyration.1    842 non-null float64
skewness_about                 840 non-null float64
skewness_about.1               845 non-null float64
skewness_about.2               845 non-null float64
hollows_ratio                  846 non-null int64
class                          846 non-null object
dtypes: float64(14), int64(4), object(1)
memory usage: 125.7+ KB
In [5]:
# All columns have numerical values
# Class would be the target variable. Should be removed when PCA is done
In [6]:
df.isna().sum()
Out[6]:
compactness                    0
circularity                    5
distance_circularity           4
radius_ratio                   6
pr.axis_aspect_ratio           2
max.length_aspect_ratio        0
scatter_ratio                  1
elongatedness                  1
pr.axis_rectangularity         3
max.length_rectangularity      0
scaled_variance                3
scaled_variance.1              2
scaled_radius_of_gyration      2
scaled_radius_of_gyration.1    4
skewness_about                 6
skewness_about.1               1
skewness_about.2               1
hollows_ratio                  0
class                          0
dtype: int64

There are various ways to handle missing values. Drop the rows, replace missing values with median values etc. from above we can say that we have NAN or 0 in the column. We could drop those rows - which might not be a good idea under all situations. Here, we will replace them with their median values.

In [7]:
df['skewness_about'].unique()
Out[7]:
array([ 6.,  9., 14.,  5., 13.,  3.,  2.,  4.,  8.,  0.,  7.,  1., 10.,
       17., 20., 18., nan, 11., 16., 21., 12., 22., 15., 19.])
In [8]:
df = df.replace('0', np.nan)
In [9]:
# replace the missing values with median value.
# Note, we do not need to specify the column names below
# every column's missing value is replaced with that column's median respectively
df=df.fillna(df.median())
In [10]:
df.isna().sum()
Out[10]:
compactness                    0
circularity                    0
distance_circularity           0
radius_ratio                   0
pr.axis_aspect_ratio           0
max.length_aspect_ratio        0
scatter_ratio                  0
elongatedness                  0
pr.axis_rectangularity         0
max.length_rectangularity      0
scaled_variance                0
scaled_variance.1              0
scaled_radius_of_gyration      0
scaled_radius_of_gyration.1    0
skewness_about                 0
skewness_about.1               0
skewness_about.2               0
hollows_ratio                  0
class                          0
dtype: int64
In [11]:
df.describe().transpose()
Out[11]:
count mean std min 25% 50% 75% max
compactness 846.0 93.678487 8.234474 73.0 87.00 93.0 100.00 119.0
circularity 846.0 44.823877 6.134272 33.0 40.00 44.0 49.00 59.0
distance_circularity 846.0 82.100473 15.741569 40.0 70.00 80.0 98.00 112.0
radius_ratio 846.0 168.874704 33.401356 104.0 141.00 167.0 195.00 333.0
pr.axis_aspect_ratio 846.0 61.677305 7.882188 47.0 57.00 61.0 65.00 138.0
max.length_aspect_ratio 846.0 8.567376 4.601217 2.0 7.00 8.0 10.00 55.0
scatter_ratio 846.0 168.887707 33.197710 112.0 147.00 157.0 198.00 265.0
elongatedness 846.0 40.936170 7.811882 26.0 33.00 43.0 46.00 61.0
pr.axis_rectangularity 846.0 20.580378 2.588558 17.0 19.00 20.0 23.00 29.0
max.length_rectangularity 846.0 147.998818 14.515652 118.0 137.00 146.0 159.00 188.0
scaled_variance 846.0 188.596927 31.360427 130.0 167.00 179.0 217.00 320.0
scaled_variance.1 846.0 439.314421 176.496341 184.0 318.25 363.5 586.75 1018.0
scaled_radius_of_gyration 846.0 174.706856 32.546277 109.0 149.00 173.5 198.00 268.0
scaled_radius_of_gyration.1 846.0 72.443262 7.468734 59.0 67.00 71.5 75.00 135.0
skewness_about 846.0 6.361702 4.903244 0.0 2.00 6.0 9.00 22.0
skewness_about.1 846.0 12.600473 8.930962 0.0 5.00 11.0 19.00 41.0
skewness_about.2 846.0 188.918440 6.152247 176.0 184.00 188.0 193.00 206.0
hollows_ratio 846.0 195.632388 7.438797 181.0 190.25 197.0 201.00 211.0

Observations: Compactness and circularity has mean and median values almost similar , it signifies that it is normally distribited and has no skewness/outlier scatter_ratio feature seems to be having some kind of skewness and outlier

In [12]:
# Check for duplicate data

dups = df.duplicated()
print('Number of duplicate rows = %d' % (dups.sum()))
Number of duplicate rows = 0
In [13]:
# Label Encoding
le=preprocessing.LabelEncoder()
df['class']=le.fit_transform(df['class'])
print(df.shape)
df.head()
(846, 19)
Out[13]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197 2
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199 2
2 104 50.0 106.0 209.0 66.0 10 207.0 32.0 23.0 158 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196 1
3 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207 2
4 85 44.0 70.0 205.0 103.0 52 149.0 45.0 19.0 144 241.0 325.0 188.0 127.0 9.0 11.0 180.0 183 0

2.Understanding the attributes - Find relationship between different attributes (Independent variables) and choose carefully which all attributes have to be a part of the analysis and why

A bivariate analysis among the different independent variables can be done using scatter matrix plot. Seaborn libs create a dashboard reflecting useful information about the dimensions

In [14]:
# Check for correlation of variable
df.corr(method='pearson')
Out[14]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
compactness 1.000000 0.684887 0.789928 0.689743 0.091534 0.148249 0.812620 -0.788750 0.813694 0.676143 0.762070 0.814012 0.585243 -0.249593 0.236078 0.157015 0.298537 0.365552 -0.033796
circularity 0.684887 1.000000 0.792320 0.620912 0.153778 0.251467 0.847938 -0.821472 0.843400 0.961318 0.796306 0.835946 0.925816 0.051946 0.144198 -0.011439 -0.104426 0.046351 -0.158910
distance_circularity 0.789928 0.792320 1.000000 0.767035 0.158456 0.264686 0.905076 -0.911307 0.893025 0.774527 0.861519 0.886017 0.705771 -0.225944 0.113924 0.265547 0.146098 0.332732 -0.064467
radius_ratio 0.689743 0.620912 0.767035 1.000000 0.663447 0.450052 0.734429 -0.789481 0.708385 0.568949 0.793415 0.718436 0.536372 -0.180397 0.048713 0.173741 0.382214 0.471309 -0.182186
pr.axis_aspect_ratio 0.091534 0.153778 0.158456 0.663447 1.000000 0.648724 0.103732 -0.183035 0.079604 0.126909 0.272910 0.089189 0.121971 0.152950 -0.058371 -0.031976 0.239886 0.267725 -0.098178
max.length_aspect_ratio 0.148249 0.251467 0.264686 0.450052 0.648724 1.000000 0.166191 -0.180140 0.161502 0.305943 0.318957 0.143253 0.189743 0.295735 0.015599 0.043422 -0.026081 0.143919 0.207619
scatter_ratio 0.812620 0.847938 0.905076 0.734429 0.103732 0.166191 1.000000 -0.971601 0.989751 0.809083 0.948662 0.993012 0.799875 -0.027542 0.074458 0.212428 0.005628 0.118817 -0.288895
elongatedness -0.788750 -0.821472 -0.911307 -0.789481 -0.183035 -0.180140 -0.971601 1.000000 -0.948996 -0.775854 -0.936382 -0.953816 -0.766314 0.103302 -0.052600 -0.185053 -0.115126 -0.216905 0.339344
pr.axis_rectangularity 0.813694 0.843400 0.893025 0.708385 0.079604 0.161502 0.989751 -0.948996 1.000000 0.810934 0.934227 0.988213 0.796690 -0.015495 0.083767 0.214700 -0.018649 0.099286 -0.258481
max.length_rectangularity 0.676143 0.961318 0.774527 0.568949 0.126909 0.305943 0.809083 -0.775854 0.810934 1.000000 0.744985 0.794615 0.866450 0.041622 0.135852 0.001366 -0.103948 0.076770 -0.032399
scaled_variance 0.762070 0.796306 0.861519 0.793415 0.272910 0.318957 0.948662 -0.936382 0.934227 0.744985 1.000000 0.945678 0.778917 0.113078 0.036729 0.194239 0.014219 0.085695 -0.312943
scaled_variance.1 0.814012 0.835946 0.886017 0.718436 0.089189 0.143253 0.993012 -0.953816 0.988213 0.794615 0.945678 1.000000 0.795017 -0.015401 0.076877 0.200811 0.006219 0.102935 -0.288115
scaled_radius_of_gyration 0.585243 0.925816 0.705771 0.536372 0.121971 0.189743 0.799875 -0.766314 0.796690 0.866450 0.778917 0.795017 1.000000 0.191473 0.166483 -0.056153 -0.224450 -0.118002 -0.250267
scaled_radius_of_gyration.1 -0.249593 0.051946 -0.225944 -0.180397 0.152950 0.295735 -0.027542 0.103302 -0.015495 0.041622 0.113078 -0.015401 0.191473 1.000000 -0.088355 -0.126183 -0.748865 -0.802123 -0.212601
skewness_about 0.236078 0.144198 0.113924 0.048713 -0.058371 0.015599 0.074458 -0.052600 0.083767 0.135852 0.036729 0.076877 0.166483 -0.088355 1.000000 -0.034990 0.115297 0.097126 0.119581
skewness_about.1 0.157015 -0.011439 0.265547 0.173741 -0.031976 0.043422 0.212428 -0.185053 0.214700 0.001366 0.194239 0.200811 -0.056153 -0.126183 -0.034990 1.000000 0.077310 0.204990 -0.010680
skewness_about.2 0.298537 -0.104426 0.146098 0.382214 0.239886 -0.026081 0.005628 -0.115126 -0.018649 -0.103948 0.014219 0.006219 -0.224450 -0.748865 0.115297 0.077310 1.000000 0.892581 0.067244
hollows_ratio 0.365552 0.046351 0.332732 0.471309 0.267725 0.143919 0.118817 -0.216905 0.099286 0.076770 0.085695 0.102935 -0.118002 -0.802123 0.097126 0.204990 0.892581 1.000000 0.235874
class -0.033796 -0.158910 -0.064467 -0.182186 -0.098178 0.207619 -0.288895 0.339344 -0.258481 -0.032399 -0.312943 -0.288115 -0.250267 -0.212601 0.119581 -0.010680 0.067244 0.235874 1.000000
In [15]:
fig=plt.figure(figsize=(15,12))
sns.heatmap(df.corr(),annot=True)#correlation function
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0xf51c26a848>

Observations:

1.'circularity' is highly correlated with 'max.length_rectangularity' and 'scaled_radius_of_gyration'(0.96 & 0.93)
2.'scatter_ratio' shows high significance with 'pr.axis_rectangularity' ,'max.length_rectangularity', 'scaled_variance','scaled_variance.1','scaled_radius_of_gyration' and 'distance_circulation'
3.'pr.axis_rectangularity' is positively correlated with 'scaled_variance','scaled_variance.1' and 'scaled_radius_of_gyration'
4.'scaled_variance' and 'scaled_variance.1' is also positively correlated
In [16]:
sns.pairplot(df, diag_kind='kde')   # to plot density curve instead of histogram on the diag
Out[16]:
<seaborn.axisgrid.PairGrid at 0xf51cb36348>
In [17]:
# Remove anyoutliers, standardize variables in pre-processing step
df.boxplot(figsize=(20,3))
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0xf529451948>
In [18]:
# We could see few outliers here. Possible mode of imputation:
In [19]:
plt.boxplot(df['radius_ratio'])
Out[19]:
{'whiskers': [<matplotlib.lines.Line2D at 0xf52ff222c8>,
  <matplotlib.lines.Line2D at 0xf52ff22a88>],
 'caps': [<matplotlib.lines.Line2D at 0xf52ff22bc8>,
  <matplotlib.lines.Line2D at 0xf52ff24ac8>],
 'boxes': [<matplotlib.lines.Line2D at 0xf52ff208c8>],
 'medians': [<matplotlib.lines.Line2D at 0xf52ff24c08>],
 'fliers': [<matplotlib.lines.Line2D at 0xf52ff25bc8>],
 'means': []}
In [20]:
z=np.abs(stats.zscore(df))
print(z)
[[0.16058035 0.51807313 0.05717723 ... 0.31201194 0.18395733 1.45708611]
 [0.32546965 0.62373151 0.12074088 ... 0.01326483 0.45297703 1.45708611]
 [1.25419283 0.84430302 1.51914112 ... 0.14937355 0.04944748 0.03200536]
 ...
 [1.49721783 1.49676282 1.20132288 ... 0.31201194 0.72199673 0.03200536]
 [0.93303214 1.43930625 0.26064101 ... 0.17590322 0.08506238 0.03200536]
 [1.05454464 1.43930625 1.02340478 ... 0.47465032 0.75761164 1.45708611]]
In [21]:
threshold=3
print(np.where(z>3))
(array([  4,   4,   4,  37,  37,  37,  37,  44,  85, 100, 100, 100, 123,
       132, 135, 135, 135, 135, 291, 291, 321, 321, 388, 388, 388, 388,
       388, 391, 396, 513, 516, 523, 523, 623, 687, 687, 706, 706, 706,
       733, 761, 835, 835, 835], dtype=int64), array([ 4,  5, 13,  3,  4,  5, 13,  0, 11,  4,  5, 13, 14, 15,  3,  4,  5,
       13,  4,  5, 10, 11,  3,  4,  5, 10, 13,  5, 10, 15, 14,  4,  5, 14,
       10, 11,  4,  5, 13, 10, 14,  8, 10, 11], dtype=int64))
In [22]:
print(z[4][4]) #z score higher than 3
5.245642630921364
In [23]:
df=df[(z<3).all(axis=1)]
In [24]:
df.shape #new shape
Out[24]:
(824, 19)
In [25]:
df.boxplot(figsize=(20,3))
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0xf52ff347c8>
In [26]:
# We could see most of the outliers are now removed.
In [27]:
plt.boxplot(df['radius_ratio'])
Out[27]:
{'whiskers': [<matplotlib.lines.Line2D at 0xf5300e0f08>,
  <matplotlib.lines.Line2D at 0xf5300e0e88>],
 'caps': [<matplotlib.lines.Line2D at 0xf5300e4ec8>,
  <matplotlib.lines.Line2D at 0xf5300e4e48>],
 'boxes': [<matplotlib.lines.Line2D at 0xf5300e0548>],
 'medians': [<matplotlib.lines.Line2D at 0xf5300eaf88>],
 'fliers': [<matplotlib.lines.Line2D at 0xf5300eaec8>],
 'means': []}

3.Split the data into train and test

In [28]:
#Let us break the X and y dataframes into training set and test set. For this we will use
#Sklearn package's data splitting function which is based on random function

from sklearn.model_selection import train_test_split
In [29]:
#now separate the dataframe into dependent and independent variables
#print("shape of new_vehicle_df_independent_attr::",X.shape)
#print("shape of new_vehicle_df_dependent_attr::",y.shape)

X = df.iloc[:,0:18].values
y = df.iloc[:,18].values

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30, random_state=10)

4.Train a Support vector machine using the train set and get the accuracy on the test set

In [30]:
from sklearn import svm
clr = svm.SVC(gamma='scale')  
clr.fit(X_train , y_train)
Out[30]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)
In [31]:
y_predict = clr.predict(X_test)
In [32]:
# Calculation of accuracy
In [33]:
print("Accuracy on training set: {:.4f}".format(clr.score(X_train, y_train)))
print("Accuracy on test set: {:.4f}".format(clr.score(X_test, y_test)))
Accuracy on training set: 0.6840
Accuracy on test set: 0.6613
In [34]:
model_score = clr.score(X_test, y_test)
print(model_score)
0.6612903225806451
In [35]:
#Store the accuracy results for each model in a dataframe for final comparison
resultsDf = pd.DataFrame({'Method':['svm(raw data)'], 'accuracy': model_score})
resultsDf = resultsDf[['Method', 'accuracy']]
resultsDf
Out[35]:
Method accuracy
0 svm(raw data) 0.66129

5.Perform K-fold cross validation and get the cross validation score of the model

In [36]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

from sklearn.model_selection import KFold

num_folds = 50
seed = 7

kfold = KFold(n_splits=num_folds, random_state=10)
model = LogisticRegression()
results = cross_val_score(model, X, y, cv=kfold)
print(results)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
[0.94117647 0.88235294 0.94117647 0.88235294 0.94117647 1.
 0.94117647 0.94117647 0.94117647 0.88235294 1.         0.94117647
 1.         0.94117647 1.         0.94117647 0.88235294 1.
 1.         0.88235294 0.94117647 1.         1.         0.94117647
 0.9375     0.9375     0.8125     0.9375     1.         1.
 0.9375     1.         1.         1.         0.9375     0.9375
 1.         1.         0.875      1.         0.9375     1.
 1.         0.9375     1.         1.         0.875      1.
 1.         1.        ]
Accuracy: 95.654% (4.701%)
In [37]:
kfmodel_score = results.mean()
print(kfmodel_score)
0.9565441176470588
In [38]:
tempResultsDf = pd.DataFrame({'Method':['K-fold cross val(raw data)'], 'accuracy':kfmodel_score })
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Method', 'accuracy']]
resultsDf
Out[38]:
Method accuracy
0 svm(raw data) 0.661290
0 K-fold cross val(raw data) 0.956544

6.Use PCA from Scikit learn, extract Principal Components that capture about 95% of the variance in the data

In [39]:
# Drop class variables
df_new =df.drop(['class'], axis =1)

df_new.head()
Out[39]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199
2 104 50.0 106.0 209.0 66.0 10 207.0 32.0 23.0 158 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196
3 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207
5 107 44.0 106.0 172.0 50.0 6 255.0 26.0 28.0 169 280.0 957.0 264.0 85.0 5.0 9.0 181.0 183
In [40]:
# Scaling The Independent Data Set
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
df_scaled =  sc.fit_transform(df_new)  
In [41]:
df_scaled
Out[41]:
array([[ 0.17951215,  0.53294153,  0.06812329, ...,  0.39524814,
        -0.31968164,  0.18069854],
       [-0.31519692, -0.61579044,  0.13191146, ...,  0.16829032,
         0.00673431,  0.45240582],
       [ 1.29260756,  0.86115067,  1.53525126, ..., -0.39910421,
        -0.15647366,  0.04484489],
       ...,
       [ 1.5399621 ,  1.51756894,  1.2163104 , ..., -0.96649875,
        -0.31968164,  0.72411311],
       [-0.93358326, -1.43631328, -0.25081757, ...,  1.4165583 ,
         0.16994228, -0.09100875],
       [-1.05726053, -1.43631328, -1.01627564, ...,  0.62220595,
        -0.48288961, -0.77027697]])
In [42]:
covMatrix = np.cov(df_scaled,rowvar=False)
print(covMatrix)
[[ 1.00121507  0.68217024  0.78822181  0.75243342  0.21698852  0.46386185
   0.80932301 -0.78575974  0.81064247  0.67398599  0.78796074  0.81175375
   0.57474275 -0.29127966  0.22075084  0.15984485  0.31959319  0.39657645]
 [ 0.68217024  1.00121507  0.79171052  0.64957366  0.22025961  0.541045
   0.85047918 -0.82123179  0.84735643  0.96444884  0.81157051  0.84016678
   0.92992793  0.02955186  0.14430926 -0.01286449 -0.09028883  0.06969066]
 [ 0.78822181  0.79171052  1.00121507  0.81531437  0.26136013  0.6282153
   0.90869196 -0.91114978  0.89779661  0.77220204  0.88892309  0.89205304
   0.70207339 -0.28197451  0.10625657  0.26914753  0.1611285   0.35768519]
 [ 0.75243342  0.64957366  0.81531437  1.00121507  0.67238671  0.44497519
   0.80028668 -0.85032469  0.7751962   0.58586652  0.80720638  0.78893446
   0.55961996 -0.42786646  0.05149701  0.18807334  0.43547261  0.52425001]
 [ 0.21698852  0.22025961  0.26136013  0.67238671  1.00121507  0.1651475
   0.2245388  -0.3213513   0.19504616  0.16281395  0.24594416  0.212859
   0.1754639  -0.31877894 -0.05183531 -0.02492353  0.40277468  0.41403888]
 [ 0.46386185  0.541045    0.6282153   0.44497519  0.1651475   1.00121507
   0.48339033 -0.48613393  0.48206973  0.61894741  0.41441093  0.44747418
   0.40092548 -0.30173529  0.07959367  0.14037242  0.05010063  0.36011497]
 [ 0.80932301  0.85047918  0.90869196  0.80028668  0.2245388   0.48339033
   1.00121507 -0.97424675  0.99046026  0.81016282  0.9786196   0.99422189
   0.79267033 -0.04571703  0.0659957   0.21507259  0.02839579  0.15614863]
 [-0.78575974 -0.82123179 -0.91114978 -0.85032469 -0.3213513  -0.48613393
  -0.97424675  1.00121507 -0.95172654 -0.77382024 -0.96896218 -0.95815044
  -0.76184901  0.13011463 -0.04625222 -0.18704    -0.13382239 -0.24558722]
 [ 0.81064247  0.84735643  0.89779661  0.7751962   0.19504616  0.48206973
   0.99046026 -0.95172654  1.00121507  0.81333886  0.96388157  0.98877992
   0.7895184  -0.03017169  0.07481317  0.21721795  0.00410746  0.13785563]
 [ 0.67398599  0.96444884  0.77220204  0.58586652  0.16281395  0.61894741
   0.81016282 -0.77382024  0.81333886  1.00121507  0.75394915  0.79725905
   0.86760988  0.01195312  0.13366119  0.00243975 -0.08973262  0.09994047]
 [ 0.78796074  0.81157051  0.88892309  0.80720638  0.24594416  0.41441093
   0.9786196  -0.96896218  0.96388157  0.75394915  1.00121507  0.97563533
   0.7783421  -0.03339305  0.03192684  0.20633759  0.05530118  0.13820111]
 [ 0.81175375  0.84016678  0.89205304  0.78893446  0.212859    0.44747418
   0.99422189 -0.95815044  0.98877992  0.79725905  0.97563533  1.00121507
   0.78688256 -0.03372731  0.0680658   0.2039669   0.03234933  0.14519628]
 [ 0.57474275  0.92992793  0.70207339  0.55961996  0.1754639   0.40092548
   0.79267033 -0.76184901  0.7895184   0.86760988  0.7783421   0.78688256
   1.00121507  0.17637221  0.16871663 -0.05838083 -0.20669547 -0.08709806]
 [-0.29127966  0.02955186 -0.28197451 -0.42786646 -0.31877894 -0.30173529
  -0.04571703  0.13011463 -0.03017169  0.01195312 -0.03339305 -0.03372731
   0.17637221  1.00121507 -0.08777547 -0.14013775 -0.8440902  -0.91702214]
 [ 0.22075084  0.14430926  0.10625657  0.05149701 -0.05183531  0.07959367
   0.0659957  -0.04625222  0.07481317  0.13366119  0.03192684  0.0680658
   0.16871663 -0.08777547  1.00121507 -0.04367963  0.10247713  0.08440021]
 [ 0.15984485 -0.01286449  0.26914753  0.18807334 -0.02492353  0.14037242
   0.21507259 -0.18704     0.21721795  0.00243975  0.20633759  0.2039669
  -0.05838083 -0.14013775 -0.04367963  1.00121507  0.07888897  0.20636452]
 [ 0.31959319 -0.09028883  0.1611285   0.43547261  0.40277468  0.05010063
   0.02839579 -0.13382239  0.00410746 -0.08973262  0.05530118  0.03234933
  -0.20669547 -0.8440902   0.10247713  0.07888897  1.00121507  0.89273608]
 [ 0.39657645  0.06969066  0.35768519  0.52425001  0.41403888  0.36011497
   0.15614863 -0.24558722  0.13785563  0.09994047  0.13820111  0.14519628
  -0.08709806 -0.91702214  0.08440021  0.20636452  0.89273608  1.00121507]]
In [43]:
covMatrix.shape
Out[43]:
(18, 18)
In [44]:
# Step 2- Get eigen values and eigen vector
eig_vals, eig_vecs = np.linalg.eig(covMatrix)
print('Eigen Vectors \n%s', eig_vecs)
print('\n Eigen Values \n%s', eig_vals)
Eigen Vectors 
%s [[-2.71309795e-01 -8.73254992e-02  4.12525682e-02  1.45446164e-01
   1.62106687e-01  2.18441148e-01 -2.49159797e-01 -7.57790783e-01
   3.46444453e-01  1.75790006e-01 -6.12349381e-02 -6.64587067e-03
  -1.12656376e-02 -1.14250379e-01 -1.28888167e-01 -1.06071706e-02
  -2.70493083e-02 -1.51701066e-02]
 [-2.84449778e-01  1.45592977e-01  2.01746727e-01 -2.84005677e-02
  -1.31358918e-01 -2.66296378e-02  3.88607045e-01 -8.70916948e-02
   5.00964323e-02 -1.57155989e-01  4.70879351e-02  1.86957975e-01
  -8.59039936e-03 -1.23985195e-01  2.47106065e-01  1.07325660e-01
  -6.65801402e-01 -2.90037795e-01]
 [-3.00909965e-01 -3.81921704e-02 -7.54331346e-02  1.04297062e-01
  -7.85065596e-02  2.07109277e-03 -1.08832809e-01  3.09000450e-01
   3.40704654e-01 -8.26282981e-02 -7.63669679e-01 -4.13861513e-02
   9.15491318e-03  2.30023580e-01  2.56939245e-02 -2.54978086e-02
  -9.13228637e-02  8.64694079e-02]
 [-2.75770685e-01 -1.94070827e-01 -4.15565569e-02 -2.40023782e-01
   1.26097997e-01 -1.54235750e-01 -1.39181969e-01  6.16678808e-02
   1.64116443e-01  2.61840698e-02  1.56710036e-01  1.66031259e-01
  -4.01471990e-02 -1.90877252e-01  7.24290568e-01 -3.11025989e-02
   2.07264912e-01  2.78875723e-01]
 [-1.07425438e-01 -2.48378425e-01  9.59906665e-02 -6.06286399e-01
   6.78208046e-02 -6.04052142e-01 -6.73512130e-02 -1.43021765e-01
   3.21998087e-02 -8.51737513e-02 -4.44633937e-02 -7.40928663e-02
   2.26294903e-02  1.31162017e-01 -3.24299748e-01  9.19697753e-03
  -5.78334012e-02 -1.15875540e-01]
 [-1.87593661e-01 -6.89838664e-02  1.13390083e-01  2.47207379e-01
  -7.07075527e-01 -2.50926539e-01 -4.13366161e-01 -3.44143860e-02
  -2.33403574e-01  2.48016810e-01  1.16568059e-01  9.75894150e-02
  -7.22747704e-03  9.25202033e-02  2.36823290e-02 -9.82183715e-03
  -5.18124794e-02  7.05205132e-03]
 [-3.09300104e-01  7.77590595e-02 -1.09343480e-01  7.45503616e-04
   9.17996900e-02  7.82456666e-02 -9.88610199e-02  8.93396813e-02
  -1.13950316e-01 -1.09351030e-01  1.64747597e-01 -1.06119297e-01
   8.37869416e-01 -1.44671183e-02 -7.80223561e-02 -2.69559525e-01
  -8.06770740e-02  5.64562140e-02]
 [ 3.07625144e-01 -1.88827871e-02  9.14627983e-02  7.04563606e-02
  -8.60837058e-02 -6.27785107e-02  1.00743931e-01 -2.20496348e-01
   2.58335924e-01 -2.73933871e-02  8.27983512e-02 -1.69993629e-01
   2.48548406e-01  6.50896131e-01  3.94591981e-01  2.92062980e-03
   1.13307071e-01 -2.53629570e-01]
 [-3.06190548e-01  8.95604107e-02 -1.06986751e-01  2.59269906e-02
   8.72434776e-02  8.54384050e-02 -9.96679104e-02  3.75428672e-02
  -6.02805141e-02 -1.86442821e-01  2.68793764e-01 -2.95221115e-01
  -9.87703756e-02  3.07763859e-01 -5.98584544e-02  6.83505840e-01
  -8.25440937e-02  2.82846898e-01]
 [-2.74581919e-01  1.36026776e-01  2.06024846e-01  4.66405057e-02
  -2.51340642e-01 -1.10327428e-02  3.63380015e-01 -2.44377917e-01
  -1.18311119e-01 -4.87892625e-01 -9.06943098e-02  1.52813891e-01
  -1.59153877e-02  1.92641254e-02 -8.44295973e-02 -6.02767912e-02
   5.41972253e-01  1.32989531e-01]
 [-3.02258684e-01  7.27699345e-02 -1.38921058e-01 -5.61365488e-02
   1.55894820e-01  1.04016523e-01 -1.04470107e-01  1.49119113e-01
  -1.31550117e-01  1.77485487e-01 -2.63427894e-02  2.62702934e-01
   1.76529687e-02  5.06250285e-02  6.10073550e-03  2.55696177e-01
   3.52415598e-01 -7.08250219e-01]
 [-3.06047174e-01  8.18148047e-02 -1.08958069e-01 -3.77621201e-03
   1.27685507e-01  1.08794505e-01 -9.02793020e-02  5.44945387e-02
  -1.13258498e-01 -1.28537568e-01  2.31945779e-01 -2.14480601e-01
  -4.70438848e-01  3.19624497e-01  1.12162737e-02 -6.10329812e-01
  -8.94768976e-02 -1.23443953e-01]
 [-2.58818994e-01  2.18605050e-01  2.14106558e-01 -7.12490138e-02
  -9.99171441e-03 -6.33912857e-02  4.50390921e-01  1.17122732e-01
   1.44916845e-01  6.93662332e-01  7.02842086e-02 -1.93581998e-01
   9.90354787e-03  8.22163519e-02 -1.01720531e-01 -4.84310870e-02
   1.39749120e-01  1.73078798e-01]
 [ 6.12169850e-02  5.02566961e-01 -6.80613078e-02 -1.22704233e-01
   1.42239348e-01 -1.59981646e-01 -1.11715878e-01 -3.06576865e-01
  -5.13381796e-01  1.13808140e-01 -3.98607749e-01 -2.13070598e-01
   7.27434469e-03 -2.24592401e-02  2.92122936e-01  1.74631688e-02
  -4.06382488e-02  5.51755173e-02]
 [-3.85350884e-02 -2.76841188e-02  5.48801579e-01  5.25470638e-01
   4.80362239e-01 -3.80867143e-01 -1.20520564e-01  1.33133576e-01
  -6.88620262e-02 -7.27567697e-02  1.94807365e-02  3.94362345e-03
  -2.21657093e-03 -1.88840303e-02 -3.40056176e-04  6.42457628e-03
   2.05775750e-02 -3.36412086e-02]
 [-5.96938582e-02 -9.58106036e-02 -6.81307935e-01  4.06162675e-01
   7.72026082e-02 -4.68648122e-01  3.19105830e-01 -1.35047870e-01
   4.70632907e-03  4.70807311e-02  3.64110118e-02  8.49209351e-02
  -1.34620746e-02 -3.12381050e-03 -2.81130334e-02 -1.04714708e-02
  -1.55474428e-02  1.84881952e-02]
 [-4.85531491e-02 -5.06808273e-01  7.28491258e-02 -2.63233598e-02
   1.77288656e-01  2.41188785e-01  1.88030707e-01 -1.02985937e-01
  -4.72606264e-01  1.80479432e-01 -1.50068886e-01  3.49466416e-01
   4.20406568e-02  3.67689530e-01  1.64007707e-02 -5.65475282e-03
  -1.33649148e-01  2.08614687e-01]
 [-9.76630686e-02 -5.03512924e-01  3.97299160e-02  8.92406301e-02
  -1.18943843e-01  8.55695525e-02  1.79645105e-01 -5.68161240e-03
  -1.96900455e-01 -1.96479466e-02 -1.17042218e-01 -6.65433283e-01
   3.21189614e-04 -2.99921162e-01  1.48091845e-01  4.86753092e-02
   8.27214795e-02 -2.38469237e-01]]

 Eigen Values 
%s [9.84666540e+00 3.29980710e+00 1.20351991e+00 1.12794252e+00
 8.81333028e-01 6.66553473e-01 3.40154651e-01 2.28141119e-01
 1.20177988e-01 8.74702078e-02 6.49471642e-02 4.82952224e-02
 3.15861458e-03 3.10425921e-02 2.71046496e-02 1.03175532e-02
 1.77453419e-02 1.74946687e-02]
In [45]:
# Step 2- Get eigen values and eigen vector
eig_vals, eig_vecs = np.linalg.eig(covMatrix)
print('Eigen Vectors \n%s', eig_vecs)
print('\n Eigen Values \n%s', eig_vals) #3.Find variance and cumulative variance by each eigen vector
tot = sum(eig_vals)
var_exp = [( i /tot ) * 100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
print("Cumulative Variance Explained", cum_var_exp)
Eigen Vectors 
%s [[-2.71309795e-01 -8.73254992e-02  4.12525682e-02  1.45446164e-01
   1.62106687e-01  2.18441148e-01 -2.49159797e-01 -7.57790783e-01
   3.46444453e-01  1.75790006e-01 -6.12349381e-02 -6.64587067e-03
  -1.12656376e-02 -1.14250379e-01 -1.28888167e-01 -1.06071706e-02
  -2.70493083e-02 -1.51701066e-02]
 [-2.84449778e-01  1.45592977e-01  2.01746727e-01 -2.84005677e-02
  -1.31358918e-01 -2.66296378e-02  3.88607045e-01 -8.70916948e-02
   5.00964323e-02 -1.57155989e-01  4.70879351e-02  1.86957975e-01
  -8.59039936e-03 -1.23985195e-01  2.47106065e-01  1.07325660e-01
  -6.65801402e-01 -2.90037795e-01]
 [-3.00909965e-01 -3.81921704e-02 -7.54331346e-02  1.04297062e-01
  -7.85065596e-02  2.07109277e-03 -1.08832809e-01  3.09000450e-01
   3.40704654e-01 -8.26282981e-02 -7.63669679e-01 -4.13861513e-02
   9.15491318e-03  2.30023580e-01  2.56939245e-02 -2.54978086e-02
  -9.13228637e-02  8.64694079e-02]
 [-2.75770685e-01 -1.94070827e-01 -4.15565569e-02 -2.40023782e-01
   1.26097997e-01 -1.54235750e-01 -1.39181969e-01  6.16678808e-02
   1.64116443e-01  2.61840698e-02  1.56710036e-01  1.66031259e-01
  -4.01471990e-02 -1.90877252e-01  7.24290568e-01 -3.11025989e-02
   2.07264912e-01  2.78875723e-01]
 [-1.07425438e-01 -2.48378425e-01  9.59906665e-02 -6.06286399e-01
   6.78208046e-02 -6.04052142e-01 -6.73512130e-02 -1.43021765e-01
   3.21998087e-02 -8.51737513e-02 -4.44633937e-02 -7.40928663e-02
   2.26294903e-02  1.31162017e-01 -3.24299748e-01  9.19697753e-03
  -5.78334012e-02 -1.15875540e-01]
 [-1.87593661e-01 -6.89838664e-02  1.13390083e-01  2.47207379e-01
  -7.07075527e-01 -2.50926539e-01 -4.13366161e-01 -3.44143860e-02
  -2.33403574e-01  2.48016810e-01  1.16568059e-01  9.75894150e-02
  -7.22747704e-03  9.25202033e-02  2.36823290e-02 -9.82183715e-03
  -5.18124794e-02  7.05205132e-03]
 [-3.09300104e-01  7.77590595e-02 -1.09343480e-01  7.45503616e-04
   9.17996900e-02  7.82456666e-02 -9.88610199e-02  8.93396813e-02
  -1.13950316e-01 -1.09351030e-01  1.64747597e-01 -1.06119297e-01
   8.37869416e-01 -1.44671183e-02 -7.80223561e-02 -2.69559525e-01
  -8.06770740e-02  5.64562140e-02]
 [ 3.07625144e-01 -1.88827871e-02  9.14627983e-02  7.04563606e-02
  -8.60837058e-02 -6.27785107e-02  1.00743931e-01 -2.20496348e-01
   2.58335924e-01 -2.73933871e-02  8.27983512e-02 -1.69993629e-01
   2.48548406e-01  6.50896131e-01  3.94591981e-01  2.92062980e-03
   1.13307071e-01 -2.53629570e-01]
 [-3.06190548e-01  8.95604107e-02 -1.06986751e-01  2.59269906e-02
   8.72434776e-02  8.54384050e-02 -9.96679104e-02  3.75428672e-02
  -6.02805141e-02 -1.86442821e-01  2.68793764e-01 -2.95221115e-01
  -9.87703756e-02  3.07763859e-01 -5.98584544e-02  6.83505840e-01
  -8.25440937e-02  2.82846898e-01]
 [-2.74581919e-01  1.36026776e-01  2.06024846e-01  4.66405057e-02
  -2.51340642e-01 -1.10327428e-02  3.63380015e-01 -2.44377917e-01
  -1.18311119e-01 -4.87892625e-01 -9.06943098e-02  1.52813891e-01
  -1.59153877e-02  1.92641254e-02 -8.44295973e-02 -6.02767912e-02
   5.41972253e-01  1.32989531e-01]
 [-3.02258684e-01  7.27699345e-02 -1.38921058e-01 -5.61365488e-02
   1.55894820e-01  1.04016523e-01 -1.04470107e-01  1.49119113e-01
  -1.31550117e-01  1.77485487e-01 -2.63427894e-02  2.62702934e-01
   1.76529687e-02  5.06250285e-02  6.10073550e-03  2.55696177e-01
   3.52415598e-01 -7.08250219e-01]
 [-3.06047174e-01  8.18148047e-02 -1.08958069e-01 -3.77621201e-03
   1.27685507e-01  1.08794505e-01 -9.02793020e-02  5.44945387e-02
  -1.13258498e-01 -1.28537568e-01  2.31945779e-01 -2.14480601e-01
  -4.70438848e-01  3.19624497e-01  1.12162737e-02 -6.10329812e-01
  -8.94768976e-02 -1.23443953e-01]
 [-2.58818994e-01  2.18605050e-01  2.14106558e-01 -7.12490138e-02
  -9.99171441e-03 -6.33912857e-02  4.50390921e-01  1.17122732e-01
   1.44916845e-01  6.93662332e-01  7.02842086e-02 -1.93581998e-01
   9.90354787e-03  8.22163519e-02 -1.01720531e-01 -4.84310870e-02
   1.39749120e-01  1.73078798e-01]
 [ 6.12169850e-02  5.02566961e-01 -6.80613078e-02 -1.22704233e-01
   1.42239348e-01 -1.59981646e-01 -1.11715878e-01 -3.06576865e-01
  -5.13381796e-01  1.13808140e-01 -3.98607749e-01 -2.13070598e-01
   7.27434469e-03 -2.24592401e-02  2.92122936e-01  1.74631688e-02
  -4.06382488e-02  5.51755173e-02]
 [-3.85350884e-02 -2.76841188e-02  5.48801579e-01  5.25470638e-01
   4.80362239e-01 -3.80867143e-01 -1.20520564e-01  1.33133576e-01
  -6.88620262e-02 -7.27567697e-02  1.94807365e-02  3.94362345e-03
  -2.21657093e-03 -1.88840303e-02 -3.40056176e-04  6.42457628e-03
   2.05775750e-02 -3.36412086e-02]
 [-5.96938582e-02 -9.58106036e-02 -6.81307935e-01  4.06162675e-01
   7.72026082e-02 -4.68648122e-01  3.19105830e-01 -1.35047870e-01
   4.70632907e-03  4.70807311e-02  3.64110118e-02  8.49209351e-02
  -1.34620746e-02 -3.12381050e-03 -2.81130334e-02 -1.04714708e-02
  -1.55474428e-02  1.84881952e-02]
 [-4.85531491e-02 -5.06808273e-01  7.28491258e-02 -2.63233598e-02
   1.77288656e-01  2.41188785e-01  1.88030707e-01 -1.02985937e-01
  -4.72606264e-01  1.80479432e-01 -1.50068886e-01  3.49466416e-01
   4.20406568e-02  3.67689530e-01  1.64007707e-02 -5.65475282e-03
  -1.33649148e-01  2.08614687e-01]
 [-9.76630686e-02 -5.03512924e-01  3.97299160e-02  8.92406301e-02
  -1.18943843e-01  8.55695525e-02  1.79645105e-01 -5.68161240e-03
  -1.96900455e-01 -1.96479466e-02 -1.17042218e-01 -6.65433283e-01
   3.21189614e-04 -2.99921162e-01  1.48091845e-01  4.86753092e-02
   8.27214795e-02 -2.38469237e-01]]

 Eigen Values 
%s [9.84666540e+00 3.29980710e+00 1.20351991e+00 1.12794252e+00
 8.81333028e-01 6.66553473e-01 3.40154651e-01 2.28141119e-01
 1.20177988e-01 8.74702078e-02 6.49471642e-02 4.82952224e-02
 3.15861458e-03 3.10425921e-02 2.71046496e-02 1.03175532e-02
 1.77453419e-02 1.74946687e-02]
Cumulative Variance Explained [ 54.63730871  72.9473225   79.62542986  85.88417238  90.77452487
  94.4731057   96.36056038  97.62647287  98.2933181   98.77867396
  99.13905365  99.40703483  99.57928438  99.72968302  99.82814861
  99.92522326  99.98247344 100.        ]
In [46]:
 # Make a set of (eigenvalue, eigenvector) pairs:

eig_pairs = [(eig_vals[index], eig_vecs[:,index]) for index in range(len(eig_vals))]

# Sort the (eigenvalue, eigenvector) pairs from highest to lowest with respect to eigenvalue
eig_pairs.sort()

eig_pairs.reverse()
print(eig_pairs)

# Extract the descending ordered eigenvalues and eigenvectors
eigvalues_sorted = [eig_pairs[index][0] for index in range(len(eig_vals))]
eigvectors_sorted = [eig_pairs[index][1] for index in range(len(eig_vals))]

# Let's confirm our sorting worked, print out eigenvalues
print('Eigenvalues in descending order: \n%s' %eigvalues_sorted)
[(9.846665404618847, array([-0.27130979, -0.28444978, -0.30090997, -0.27577068, -0.10742544,
       -0.18759366, -0.3093001 ,  0.30762514, -0.30619055, -0.27458192,
       -0.30225868, -0.30604717, -0.25881899,  0.06121698, -0.03853509,
       -0.05969386, -0.04855315, -0.09766307])), (3.299807102376317, array([-0.0873255 ,  0.14559298, -0.03819217, -0.19407083, -0.24837842,
       -0.06898387,  0.07775906, -0.01888279,  0.08956041,  0.13602678,
        0.07276993,  0.0818148 ,  0.21860505,  0.50256696, -0.02768412,
       -0.0958106 , -0.50680827, -0.50351292])), (1.2035199065401396, array([ 0.04125257,  0.20174673, -0.07543313, -0.04155656,  0.09599067,
        0.11339008, -0.10934348,  0.0914628 , -0.10698675,  0.20602485,
       -0.13892106, -0.10895807,  0.21410656, -0.06806131,  0.54880158,
       -0.68130793,  0.07284913,  0.03972992])), (1.127942516794535, array([ 0.14544616, -0.02840057,  0.10429706, -0.24002378, -0.6062864 ,
        0.24720738,  0.0007455 ,  0.07045636,  0.02592699,  0.04664051,
       -0.05613655, -0.00377621, -0.07124901, -0.12270423,  0.52547064,
        0.40616268, -0.02632336,  0.08924063])), (0.881333027515427, array([ 0.16210669, -0.13135892, -0.07850656,  0.126098  ,  0.0678208 ,
       -0.70707553,  0.09179969, -0.08608371,  0.08724348, -0.25134064,
        0.15589482,  0.12768551, -0.00999171,  0.14223935,  0.48036224,
        0.07720261,  0.17728866, -0.11894384])), (0.6665534730091377, array([ 0.21844115, -0.02662964,  0.00207109, -0.15423575, -0.60405214,
       -0.25092654,  0.07824567, -0.06277851,  0.0854384 , -0.01103274,
        0.10401652,  0.10879451, -0.06339129, -0.15998165, -0.38086714,
       -0.46864812,  0.24118879,  0.08556955])), (0.3401546505699416, array([-0.2491598 ,  0.38860704, -0.10883281, -0.13918197, -0.06735121,
       -0.41336616, -0.09886102,  0.10074393, -0.09966791,  0.36338002,
       -0.10447011, -0.0902793 ,  0.45039092, -0.11171588, -0.12052056,
        0.31910583,  0.18803071,  0.17964511])), (0.2281411193843162, array([-0.75779078, -0.08709169,  0.30900045,  0.06166788, -0.14302176,
       -0.03441439,  0.08933968, -0.22049635,  0.03754287, -0.24437792,
        0.14911911,  0.05449454,  0.11712273, -0.30657687,  0.13313358,
       -0.13504787, -0.10298594, -0.00568161])), (0.12017798767277535, array([ 0.34644445,  0.05009643,  0.34070465,  0.16411644,  0.03219981,
       -0.23340357, -0.11395032,  0.25833592, -0.06028051, -0.11831112,
       -0.13155012, -0.1132585 ,  0.14491684, -0.5133818 , -0.06886203,
        0.00470633, -0.47260626, -0.19690046])), (0.08747020784078298, array([ 0.17579001, -0.15715599, -0.0826283 ,  0.02618407, -0.08517375,
        0.24801681, -0.10935103, -0.02739339, -0.18644282, -0.48789262,
        0.17748549, -0.12853757,  0.69366233,  0.11380814, -0.07275677,
        0.04708073,  0.18047943, -0.01964795])), (0.06494716421497158, array([-0.06123494,  0.04708794, -0.76366968,  0.15671004, -0.04446339,
        0.11656806,  0.1647476 ,  0.08279835,  0.26879376, -0.09069431,
       -0.02634279,  0.23194578,  0.07028421, -0.39860775,  0.01948074,
        0.03641101, -0.15006889, -0.11704222])), (0.04829522238977759, array([-0.00664587,  0.18695798, -0.04138615,  0.16603126, -0.07409287,
        0.09758942, -0.1061193 , -0.16999363, -0.29522111,  0.15281389,
        0.26270293, -0.2144806 , -0.193582  , -0.2130706 ,  0.00394362,
        0.08492094,  0.34946642, -0.66543328])), (0.031042592089074467, array([-0.11425038, -0.12398519,  0.23002358, -0.19087725,  0.13116202,
        0.0925202 , -0.01446712,  0.65089613,  0.30776386,  0.01926413,
        0.05062503,  0.3196245 ,  0.08221635, -0.02245924, -0.01888403,
       -0.00312381,  0.36768953, -0.29992116])), (0.027104649578985757, array([-1.28888167e-01,  2.47106065e-01,  2.56939245e-02,  7.24290568e-01,
       -3.24299748e-01,  2.36823290e-02, -7.80223561e-02,  3.94591981e-01,
       -5.98584544e-02, -8.44295973e-02,  6.10073550e-03,  1.12162737e-02,
       -1.01720531e-01,  2.92122936e-01, -3.40056176e-04, -2.81130334e-02,
        1.64007707e-02,  1.48091845e-01])), (0.01774534188319695, array([-0.02704931, -0.6658014 , -0.09132286,  0.20726491, -0.0578334 ,
       -0.05181248, -0.08067707,  0.11330707, -0.08254409,  0.54197225,
        0.3524156 , -0.0894769 ,  0.13974912, -0.04063825,  0.02057757,
       -0.01554744, -0.13364915,  0.08272148])), (0.01749466869925411, array([-0.01517011, -0.29003779,  0.08646941,  0.27887572, -0.11587554,
        0.00705205,  0.05645621, -0.25362957,  0.2828469 ,  0.13298953,
       -0.70825022, -0.12344395,  0.1730788 ,  0.05517552, -0.03364121,
        0.0184882 ,  0.20861469, -0.23846924])), (0.01031755315840211, array([-0.01060717,  0.10732566, -0.02549781, -0.0311026 ,  0.00919698,
       -0.00982184, -0.26955952,  0.00292063,  0.68350584, -0.06027679,
        0.25569618, -0.61032981, -0.04843109,  0.01746317,  0.00642458,
       -0.01047147, -0.00565475,  0.04867531])), (0.0031586145802925485, array([-1.12656376e-02, -8.59039936e-03,  9.15491318e-03, -4.01471990e-02,
        2.26294903e-02, -7.22747704e-03,  8.37869416e-01,  2.48548406e-01,
       -9.87703756e-02, -1.59153877e-02,  1.76529687e-02, -4.70438848e-01,
        9.90354787e-03,  7.27434469e-03, -2.21657093e-03, -1.34620746e-02,
        4.20406568e-02,  3.21189614e-04]))]
Eigenvalues in descending order: 
[9.846665404618847, 3.299807102376317, 1.2035199065401396, 1.127942516794535, 0.881333027515427, 0.6665534730091377, 0.3401546505699416, 0.2281411193843162, 0.12017798767277535, 0.08747020784078298, 0.06494716421497158, 0.04829522238977759, 0.031042592089074467, 0.027104649578985757, 0.01774534188319695, 0.01749466869925411, 0.01031755315840211, 0.0031586145802925485]
In [47]:
# Ploting 
plt.figure(figsize=(10 , 5))
plt.bar(range(1, eig_vals.size + 1), var_exp, alpha = 0.5, align = 'center', label = 'Individual explained variance')
plt.step(range(1, eig_vals.size + 1), cum_var_exp, where='mid', label = 'Cumulative explained variance')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.legend(loc = 'best')
plt.tight_layout()
plt.show()

Observations: From above we plot we can clealry observer that 8 dimension are able to explain 95 %variance of data.so we will use first 8 principal components going forward and calulate the reduced dimensions

In [48]:
pca = PCA(n_components=8)
pca.fit(df_scaled)
Out[48]:
PCA(copy=True, iterated_power='auto', n_components=8, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)
In [49]:
df_pca=pca.transform(df_scaled)
In [50]:
pca.components_
Out[50]:
array([[-2.71309795e-01, -2.84449778e-01, -3.00909965e-01,
        -2.75770685e-01, -1.07425438e-01, -1.87593661e-01,
        -3.09300104e-01,  3.07625144e-01, -3.06190548e-01,
        -2.74581919e-01, -3.02258684e-01, -3.06047174e-01,
        -2.58818994e-01,  6.12169850e-02, -3.85350884e-02,
        -5.96938582e-02, -4.85531491e-02, -9.76630686e-02],
       [-8.73254992e-02,  1.45592977e-01, -3.81921704e-02,
        -1.94070827e-01, -2.48378425e-01, -6.89838664e-02,
         7.77590595e-02, -1.88827871e-02,  8.95604107e-02,
         1.36026776e-01,  7.27699345e-02,  8.18148047e-02,
         2.18605050e-01,  5.02566961e-01, -2.76841188e-02,
        -9.58106036e-02, -5.06808273e-01, -5.03512924e-01],
       [-4.12525682e-02, -2.01746727e-01,  7.54331346e-02,
         4.15565569e-02, -9.59906665e-02, -1.13390083e-01,
         1.09343480e-01, -9.14627983e-02,  1.06986751e-01,
        -2.06024846e-01,  1.38921058e-01,  1.08958069e-01,
        -2.14106558e-01,  6.80613078e-02, -5.48801579e-01,
         6.81307935e-01, -7.28491258e-02, -3.97299160e-02],
       [ 1.45446164e-01, -2.84005677e-02,  1.04297062e-01,
        -2.40023782e-01, -6.06286399e-01,  2.47207379e-01,
         7.45503616e-04,  7.04563606e-02,  2.59269906e-02,
         4.66405057e-02, -5.61365488e-02, -3.77621201e-03,
        -7.12490138e-02, -1.22704233e-01,  5.25470638e-01,
         4.06162675e-01, -2.63233598e-02,  8.92406301e-02],
       [-1.62106687e-01,  1.31358918e-01,  7.85065596e-02,
        -1.26097997e-01, -6.78208046e-02,  7.07075527e-01,
        -9.17996900e-02,  8.60837058e-02, -8.72434776e-02,
         2.51340642e-01, -1.55894820e-01, -1.27685507e-01,
         9.99171441e-03, -1.42239348e-01, -4.80362239e-01,
        -7.72026082e-02, -1.77288656e-01,  1.18943843e-01],
       [-2.18441148e-01,  2.66296378e-02, -2.07109277e-03,
         1.54235750e-01,  6.04052142e-01,  2.50926539e-01,
        -7.82456666e-02,  6.27785107e-02, -8.54384050e-02,
         1.10327428e-02, -1.04016523e-01, -1.08794505e-01,
         6.33912857e-02,  1.59981646e-01,  3.80867143e-01,
         4.68648122e-01, -2.41188785e-01, -8.55695525e-02],
       [ 2.49159797e-01, -3.88607045e-01,  1.08832809e-01,
         1.39181969e-01,  6.73512130e-02,  4.13366161e-01,
         9.88610199e-02, -1.00743931e-01,  9.96679104e-02,
        -3.63380015e-01,  1.04470107e-01,  9.02793020e-02,
        -4.50390921e-01,  1.11715878e-01,  1.20520564e-01,
        -3.19105830e-01, -1.88030707e-01, -1.79645105e-01],
       [-7.57790783e-01, -8.70916948e-02,  3.09000450e-01,
         6.16678808e-02, -1.43021765e-01, -3.44143860e-02,
         8.93396813e-02, -2.20496348e-01,  3.75428672e-02,
        -2.44377917e-01,  1.49119113e-01,  5.44945387e-02,
         1.17122732e-01, -3.06576865e-01,  1.33133576e-01,
        -1.35047870e-01, -1.02985937e-01, -5.68161240e-03]])
In [51]:
df_scaled.shape
Out[51]:
(824, 18)
In [52]:
df_pca.shape
Out[52]:
(824, 8)

7. Repeat steps 3,4 and 5 but this time, use Principal Components instead of the original data. And the accuracy score should be on the same rows of test data that were used earlier. (hint: set the same random state)

In [53]:
# Fitting SVM ON PCA Data
# Split the data into train and test

X1_train, X1_test, y1_train, y1_test = train_test_split(df_pca,y,test_size=0.30, random_state=10)
In [54]:
clr1 = svm.SVC(gamma='scale')  
clr1.fit(X1_train , y1_train)
Out[54]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)
In [55]:
x_pca_predict = clr1.predict(X1_test)
In [56]:
print("Accuracy on training set: {:.4f}".format(clr1.score(X1_train, y1_train)))
print("Accuracy on test set: {:.4f}".format(clr1.score(X1_test, y1_test)))
Accuracy on training set: 0.9722
Accuracy on test set: 0.9153
In [57]:
model_score1 = clr1.score(X1_test, y1_test)
print(model_score1)
0.9153225806451613
In [58]:
tempResultsDf = pd.DataFrame({'Method':['svm(pca data)'], 'accuracy': [model_score1]})
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Method', 'accuracy']]
resultsDf
Out[58]:
Method accuracy
0 svm(raw data) 0.661290
0 K-fold cross val(raw data) 0.956544
0 svm(pca data) 0.915323
In [59]:
# K-fold cross validation on PCA data
num_folds = 50
seed = 7

kfold1 = KFold(n_splits=num_folds, random_state=seed)
model1 = LogisticRegression()
results = cross_val_score(model1, df_pca, y, cv=kfold1)
print(results)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
[0.82352941 0.76470588 0.88235294 0.82352941 0.70588235 0.94117647
 0.76470588 0.88235294 0.82352941 0.82352941 0.82352941 1.
 0.88235294 0.82352941 0.88235294 0.82352941 0.70588235 0.88235294
 0.88235294 0.70588235 0.88235294 0.76470588 0.82352941 0.88235294
 0.8125     0.875      0.875      0.8125     0.875      0.8125
 0.875      0.8125     0.875      0.875      0.8125     0.8125
 0.75       0.8125     0.8125     0.9375     0.8125     0.9375
 1.         0.625      0.9375     0.75       0.875      0.8125
 0.9375     0.875     ]
Accuracy: 84.000% (7.304%)
In [60]:
kfmodel_score1 = results.mean()
print(kfmodel_score1)
0.84

8. Compare the accuracy scores and cross validation scores of Support vector machines – one trained using raw data and the other using Principal Components, and mention your findings (5 points)

In [61]:
tempResultsDf = pd.DataFrame({'Method':['K-fold cross val(pca data)'], 'accuracy':kfmodel_score1})
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Method', 'accuracy']]
resultsDf
Out[61]:
Method accuracy
0 svm(raw data) 0.661290
0 K-fold cross val(raw data) 0.956544
0 svm(pca data) 0.915323
0 K-fold cross val(pca data) 0.840000

Observations:

  1. On testing data set we saw that our support vector classifier without performing PCA has an accuracy score of 66 % But when we applied the SVC model on PCA componenets(reduced dimensions) our model scored 91 %.
  2. When we appliy k.fold crss validation on raw data we get 95% accuracy. 3..Considering that original dataframe had 18 dimensions and After PCA dimension reduced to 8, our model has fared well in terms of accuracy score. 4.By reducing dimension to 8 we get accuracy 84% with std (7.3%)